SiGra: single-cell spatial elucidation through an image-augmented graph transformer

Recent advances in high-throughput molecular imaging have pushed spatial transcriptomics technologies to subcellular resolution, which surpasses the limitations of both single-cell RNA-seq and array-based spatial profiling. The multichannel immunohistochemistry images in such data provide rich information on the cell types, functions, and morphologies of cellular compartments. In this work, we developed a method, single-cell spatial elucidation through image-augmented Graph transformer (SiGra), to leverage such imaging information for revealing spatial domains and enhancing substantially sparse and noisy transcriptomics data. SiGra applies hybrid graph transformers over a single-cell spatial graph. SiGra outperforms state-of-the-art methods on both single-cell and spot-level spatial transcriptomics data from complex tissues. The inclusion of immunohistochemistry images improves the model performance by 37% (95% CI: 27–50%). SiGra improves the characterization of intratumor heterogeneity and intercellular communication and recovers the known microscopic anatomy. Overall, SiGra effectively integrates different spatial modality data to gain deep insights into spatial cellular ecosystems.

), Accuracy of identifying spatial domains based on 10 simulation replicates over a range of dropout levels (10 samples for each).Clustering accuracy is quantified using ARI.In the boxplot, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively.Source data are provided as a Source Data file.
Here we demonstrate that the enhanced data by SiGra provides more information than raw data and highlight the necessity of gene expression enhancement for single-cell spatial transcriptomics (SCST) data.

1). NanoString CosMx Lung cancer slice
Specifically, for the lung cancer NanoString CosMx data in Fig. 2, we compared both enhanced data and raw data with existing bulk RNA-seq data.Here we used the bulk RNA-seq data from TCGA lung cancer patients.As shown in Supplementary Fig. 2a, the x-axis and y-axis represented the total log-transformed counts per gene in lung cancer slices between the two types of technologies, i.e., SCST and bulk.Each point represented the RNA count for a single gene, averaged across different experimental samples for the corresponding technology.The RNA counts between enhanced SCST and bulk sequencing showed better concordance (cor = 0.631) than that between raw SCST and bulk data (cor = 0.579).
In addition, we also evaluated the potential false discoveries in the L-R associations (Fig. 3e) using randomized control.Briefly, we assumed that randomly selected gene pairs from the SCST data were not likely associated and thus used as negative controls.By comparing with these negative controls, the false discovery rate of the selected L-R associations was estimated.Briefly, 10,000 gene pairs were randomly selected, and the corresponding Pearson correlations were calculated as negative control.For each of the L-R pair (Fig. 3e), we estimated the false discovery rate accordingly using the FDR values based on the negative controls instead of the Pearson correlations.As shown in the Fig. 3e, the y-axis and x-axis referred to the FDR values of each L-R pair in the enhanced and raw data respectively.Across the total 660 L-R interactions, 55 L-R pairs from the enhanced data were statistically significant (FDR < 0.05), whereas 42 L-R pairs from the raw data had FDR < 0.05.There were 28 L-R pairs shared between enhanced data and raw data, indicating enhanced data preserved useful information of raw data.In addition, 27 specific L-R interactions were identified from the enhanced data, while 14 specific L-R interactions were found in the raw data.In Supplementary Fig. 2b, we further investigated whether these specific L-R interactions pairs were similar with the shared L-R pairs.For those significant L-R pairs identified from the enhanced data, there were no significant difference between the specific and shared L-R pairs, suggesting that both had similar probability of being true associated L-R pairs.In contrast, the raw-specific L-R pairs were statistically different from the shared L-R pairs, suggesting that the raw-specific pairs were more likely to be false discoveries than the shared L-R pairs.These results indicated that the enhanced data not only enabled to detect more L-R interactions than the raw data, but also the identified L-R pairs were more likely to be true discoveries than those specifically detected in raw data.The data enhancement using SiGra not only improved the sensitivity of L-R interaction detection (identifying more L-R pairs), but also preserved the specificity (the specifically identified L-R pairs that had similar statistical significance as the shared L-R pairs).These results indicated that those raw-specific L-R pairs were more likely to be false discoveries, which could result from noises and the low data quality in the raw data.Therefore, SiGra improved both the sensitivity (more identified L-R pairs) and the specificity (more true discoveries) for the detection of L-R interactions.

2). MERSCOPE mouse liver data
For the MERSCOPE mouse liver data in Fig. 4, we visualized the identified cell clusters and enhanced gene expression of Vwf in Supplementary Fig. 2c and Supplementary Fig. 2d, accordingly.We also demonstrated the enhanced data quality by comparing with the bulk RNAseq data from The Tabula Muris Consortium 1 .Supplementary Fig. 2e shows the comparisons between SCST and bulk data for mouse liver MERSCOPE data, where the x-axis and y-axis were the total log-transformed counts.Similarly, each point represents the RNA count for a single gene, averaged across different samples for the corresponding technology.Again, we observed much higher correlation of RNA counts between enhanced SCST and bulk data (cor = 0.854), in contrast with the comparisons between raw SCST and bulk data (cor = 0.800).
In addition, we anticipated that the differential expression analysis also benefited from the enhanced data given its improved data quality.To further verify it, regarding Fig. 4e, we used the single-cell RNA-seq data from the Tabula Muris Consortium 2020 2 to identify the differential expression genes (DEGs) in the cell clusters of hepatocytes, periportal hepatocytes, hepatic stellate cells, and endothelial cells.In this way, we then evaluated the overlaps between the scRNA-seq's DEGs and enhanced data's DEGs, as well as the overlaps between the scRNA-seq's DEGs and the raw data's DEGs.As shown in the Supplementary Fig. 2f, we identified the overlapped DEGs with scRNA-seq for each cluster.The purple-colored bars represent the number of DEGs shared between scRNA-seq and raw data, and the orange-colored bars represented the number of DEGs shared between scRNA-seq and enhanced data.We also labeled the number of DEGs on the bar plot, for example, for C-1, "12 vs 12" referred to "the shared DEGs between scRNA-seq and raw data" vs "the DEGs of raw data", and "49 vs 59" referred to "the shared DEGs between scRNAseq and enhanced data" vs "the DEGs of enhanced data".Across different clusters, the enhanced data was shown to recover more dysregulated genes than the original SCST data.
'Pairwise' model used the same "limma" functions for data processing and taking into account the same correlation structure in addition to using the contrasts.fitfunction provided by "limma".Then we also computed the Student's t-test statistics for each pair of layers.The Student's t-test statistics with double-sided P values for both 'Enrichment' model and 'Pairwise' model were provided in Supplementary Table 3.Our results showed that the SiGra enhanced data showed consistent results with the original study 3 .

Supplementary Note 3: Reveal spatial domains in single-cell spatial profiles
SiGra can identify spatial domains at various resolutions, depending on the data type and the applications.The spot-level spatial data has a low spatial resolution and consists of mixed cells/cell types in each spot.For example, the spatial resolution of the 10x Visium data is 100µm, measured between the centers of two neighboring spots.For such low-resolution data, SiGra directly and accurately reveals the spatial structures such as the anatomic layers in the DLPFC slices (Fig. 5) by clustering the latent-represented spots using Leiden.In contrast, the single-cell spatial data has significantly higher resolution.For example, the spatial resolution of the NanoString CosMx molecular imaging is 52nm, and the summarized gene expression profile based on image segmentation provides single-cell level resolution.SiGra thus can reveal spatial regions at the cellular level (Fig. 2 and Fig. 3) and microanatomic level (Fig. 4, the identifications of the microanatomic regions in the liver).
Meanwhile, on such high-resolution single-cell spatial data, the regional anatomic spatial structures can be revealed by further summarizing the Leiden clustering results (heterogenous cell types) with a dimensional moving window agglomeration approach.Such approaches have been well-established in spatial data analysis of geographical information systems (GIS) data 4,5 and have recently been used for revealing spatial domains in single-cell spatial data (for example, SSAM 6 by Park et al.).Specifically, the SiGra clustering results were summarized by a circular window of diameter  sliding at both  and  directions across the whole image with a given stride length .At each stop  , with the coordinate (  ,   ), a vector  , ≡ [ 1 , … ,   ] representing the proportions of the SiGra identified clusters () covered by the sliding window was calculated.All the stops { . } were recursively merged to  groups { 1 , … ,   } by hierarchical clustering according to the cluster proportion vectors { . }.These agglomerated groups were defined as the discovered spatial domains.The original slide image was then labeled with the discovered spatial domains according to the coordinate of each stop.In this way, we obtained the spatial domains based on the heterogenous cells identified on the spatial slice.The window radius  used in our work was 100μm, which was consistent with the 10x Visium spatial resolution, with the stride  of 10μm.
The ground truth of the anatomic spatial domains of the NanoString CosMx lung cancer slide was provided by a certificated pathologist at Indiana University Health, Dr. Tieying Hou, according to the IHC images.As shown in Supplementary Fig. 5, three spatial domains were identified by Dr. Hou: the tumor region (green), the desmoplasia region (red), and the adjacent normal region (orange).For fair comparisons with other methods, the same spatial moving window agglomeration approach was used.Compared with this ground truth, SiGra achieved an ARI of 0.60, better than other methods including BayesSpace (ARI: 0.25), spaGCN (ARI: 0.10), Seurat (ARI: 0.10), stLearn (ARI: 0.10), and scanpy (ARI: 0.17).These results showed that SiGra obtained reliable spatial domains based on its identified accurate cell identities.It also indicated that the NanoString CosMx profiled cancer tissue slice was much more challenging given its strong cellular heterogeneity, large cell number, and high-resolution, compared with the 10x Visium profiled normal DLPFC tissues that had well-organized anatomic structure.
To further verify the comparison results, we also tested BayesSpace and spaGCN for direct spatial domain identification of the three domains, without using the moving window agglomeration approach.BayesSpace and spaGCN obtained ARIs of 0.15 and 0.19, respectively.These results further demonstrated that, for detecting large-scale anatomic spatial domains from single-cell spatial data, it was necessary to agglomerate the high-resolution cellular-level clustering results.ARI around 0.1, whereas SiGra achieved much better ARI (ARI: 0.7).These results showed that SiGra outperformed current available methods in recognizing single-cell spatial data.
10x Visium data: As shown in the Supplementary Fig. 8a, we evaluated the ARI scores achieved by different methods, specifically including STAGATE and MUSE.Across all 12 DLPFC slices 3 , SiGra obtained higher ARI (median ARI: 0.57) than the other methods, including stLearn (median ARI: 0.39), SpaGCN (median ARI: 0.40), and BayesSpace (median ARI: 0.44).STAGATE presented slightly lower ARI scores with median ARI only as 0.49.On slice 151669, STAGATE obtained lowest ARI (ARI: 0.27), where SiGra achieved much better ARI (ARI: 0.49).Meanwhile, MUSE also presented lower ARI scores with median value as 0.31.The highest ARI that MUSE obtained was 0.43 on slice 151669, and the lowest ARI (ARI: 0.23) was on slice 151576.These results showed that MUSE had modest performance for recognizing the spatially organized brain structures, and SiGra outperformed current available methods in recognizing the organized brain structures.

2) Ablation studies
Here we added ablation studies to investigate the contributions of the transcriptomics-based encoder-decoder (G-ED), the image-based encoder-decoder (I-ED), and the hybrid encoderdecoder (H-ED) on both single-cell spatial transcriptomics (NanoString CosMx SMI) and spot spatial transcriptomics (10x Visium) data.To further demonstrate why SiGra included an imageto-gene encoder-decoder (I-ED) instead of an image-to-image auto-encoder (I-AE), we also compared the performance of the I-AE-based and the I-ED-based ablated models.
We presented the performance of four ablated models, with only one encoder-decoder in each model.As shown in the Supplementary Fig. 8b: (1) NanoString CosMx SMI.We first evaluated the adjusted rand index (ARI) scores achieved by the ablated models on the NanoString profiled lung cancer tissue slice (Fig. 2).SiGra obtained the best ARI's across all 20 field-of-views (FOVs) (median ARI: 0.59) than the other ablated models.In contrast, without the other two components, the hybrid encoder-decoder (H-ED) alone only achieved a median ARI of 0.34.The G-ED and I-ED also presented lower ARI scores with median values of 0.40 and 0.24.I-AE obtained a lower ARI (median: 0.04) than I-ED (median: 0.24) across different FOVs.(2) 10x Visium.Then we evaluated the ARI scores achieved by the ablated models across the 12 dorsolateral prefrontal cortex (DLPFC) slices.SiGra obtained the highest ARI in all slices (median ARI: 0.57).In contrast, the H-ED presented a lower median ARI score of 0.33.The G-ED and the I-ED achieved median ARIs of 0.41 and 0.25, respectively.I-AE also presented a lower ARI (median: 0.14) than I-ED (median: 0.25).These ablation study suggested that the general imaging features extracted by an image-to-image autoencoder (I-AE) were less relevant to spatial domain detection.The imaging features that were relevant to the spatial gene expression patterns, which were extracted by the image-to-gene encoder-decoder (I-ED), proved to contribute to spatial domain identification.Additionally, the advantage of using an image-to-gene encoder-decoder rather than an image-toimage autoencoder was more significant for single-cell spatial transcriptomics data.
Collectively, these results demonstrate that the superior performance of SiGra is achieved through three encoder-decoder components.The ablated models with only the hybrid encoder-decoder H-ED or the other encoder-decoder alone are not sufficient to achieve comparable performance.

3) Simulation studies
To further verify the performance of SiGra, here we performed the simulation comparisons based on the simulation design of MUSE 8 .The only difference between our simulated data and MUSE's design was that, the simulation data were generated with spatial locations for each domain.
2) Next, the latent representations of gene expression and morphology images features:   ,   (  ∈   ,   ∈   ), were generated following the design of MUSE, where  was the size of the latent dimension.That is, for either   or   , the latent representations   of the -th cell was simulated using a multivariable normal distribution:   ~ ∑  , (  , Σ  ) , where  was the total number of the domain regions ( = 4).If cell  belonged to the -th domain, then  , =1, otherwise  , =0.  ∈   was sampled from a uniform distribution with Σ  ∈  × the identity matrix.Note that the latent gene features   and latent image features   were generated separately, that is: 3) The raw gene expression and image features were then generated by a linear transformation by   =     +  and   =     + , where  ∈  × was the random projection matrix from the uniform distribution between [-0.5, 0.5]. was the gaussian noise sampled from (0,  2 ).With additional dropouts added as below, the raw gene expression   ′ and image features   ′ were obtained: , where [.] was the indicator function, which returned 1 if the argument was true, otherwise returned 0.  was the decay coefficient that controlled dropout levels and  was the random value sampled from the uniform distribution between [0,1].
With the simulated data   ′ and   ′ , we used them as input for simulation experiments and benchmarking.
Herein, we generated simulation data including both gene-based and image-based features.We chose three settings with different dropout levels for data generation, i.e.,  as 0.2, 0.3, and 0.4.
• When  = 0.2, the generated data was visualized as below in Supplementary Fig. 9a • When  = 0.4, the generated data was visualized in Supplementary Fig. 9c.Similarly, both gene-based features and image-based features contributed to spatial domain identification at certain levels (ARI: 0.75, 0.67).SiGra maintained more accurate (ARI: 0.77) than MUSE (ARI: 0.55) and STAGATE (ARI: 0.63) in revealing spatial domains.

Supplementary Fig. 2 :
SiGra enhances spatial gene expression data from different platforms.a, Comparison of raw and enhanced data with bulk RNA-seq of lung cancer patient samples.b, Differences of enhanced-specific and raw-specific L-R pairs from shared L-R pairs (Number of L-R pairs in each group: shared: 28; raw-specific: 14; enhanced-specific: 27).In the boxplot, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively.c, Spatial visualization of all cell clusters in the singlecell spatial data from mouse liver tissue.d, Spatial visualization of the raw expressions and the enhanced expressions of Vwf.e, Comparison of enhanced data with bulk RNA-seq of mouse liver samples.f, Evaluation of DEGs using single-cell RNA-seq data of mouse liver samples.Supplementary Fig. 9: Comparisons with MUSE and STAGATE based on simulation data.(a-c), Visualizations of simulation data with different dropout levels.Spatial domains and domain clusters identified by different methods (STAGATE, MUSE, and SiGra) are shown.Gene-alone and image-alone refer to PCA results based on the single modality data.Different colors refer to the ground truth of cell types in simulation data.

.
As described above, the spatial data consisted of four spatial regions, where each of them was dominated by one major cell type with other scattered types of cells.Meanwhile, the genebased features and image-based features were shown to contribute to spatial domain identification at certain levels, with ARI as 0.54 and 0.40 respectively.However, MUSE failed to reveal clear spatial domains with ARI only as 0.28.STAGATE revealed the spatial domains with ARI as 0.58.In contrast, SiGra achieved the highest ARI (ARI: 0.71) and revealed much more accurate spatial domains.•When=0.3, the generated data was visualized in Supplementary Fig.9b.Consistently, four spatial domains were generated, with both gene-based features and image-based features contributed to spatial domain identification (ARI: 0.69, 0.60) respectively.Nevertheless, MUSE and STGATE only obtained ARI as 0.42 and 0.61, which were lower than SiGra (ARI: 0.76).